This handout covers the topics that we will go through during the lab session.

There are two main sections in this handout, required and optional.

In the required section, we are going to set up R and RStudio on your computer. Learn how to use R packages and create R Markdown files. Then, we are going to go through the basics of R programming by analyzing a simple biological data set.

In the optional section, we are going to run Linux operating system on a virtual machine.

1 Required

1.1 Install R programming language

  1. Go to https://www.r-project.org/.

  2. Click download R in the Getting Started section.

  1. Click https://cloud.r-project.org/, the first link under 0-Cloud. The URLs to the mirrors of Comprehensive R Archive Network (CRAN) are listed in this page. A mirror is a replicate of the contents of the main server, of which the URL is http://cran.r-project.org. Therefore, you can choose any of the URLs listed if you want.
  1. Choose the download link based on your operating system.
    • Linux: Choose your Linux distribution and follow the installation instructions.
    • Mac OS X: Download R-3.4.3.pkg. Open the downloaded file to install.
    • Windows: Choose base, then Download R 3.4.3 for Windows. Open the downloaded file to install.

1.2 Install RStudio

  1. Go to https://www.rstudio.com/products/rstudio/download/#download.
  2. Choose the installer to download based on your operating system. For Linux users, check if you are running 32-bit or 64-bit system.
  3. Open the downloaded file to install. If you are running Ubuntu, you might need to install by using the gdebi utility, and you can find more information at https://medium.com/@GalarnykMichael/install-r-and-rstudio-on-ubuntu-12-04-14-04-16-04-b6b3107f7779.

Note on 32-bit and 64-bit: Most of the time, we only need to know this information when trying to install or run some software. 32-bit or 64-bit refers to the size of register in the central processing unit (CPU). 32-bit system allows direct access of \(2^{32}\) bytes of memory, and 64-bit allows direct access of \(2^{64}\) bytes of memory.

1.3 Run RStudio

  1. Open the installed RStudio. If all went well, there should no error message, and there should be several panes.
  2. Find the pane with the Console tab in it and select that tab, which is the left pane by default if you have not opened any script yet. The console runs an R programming language interpreter, which executes the R code for you.
  3. Select the Console pane by clicking on its inner area. You should see a cursor blinking after the prompt >.
  4. Type print("Hello World!") and hit enter. You should see the following prompt in the next line [1] "Hello World!". An R base function print() is called with the positional parameter "Hello World". After this line of code is executed, the parameter is printed by the computer to the console.

Even though this line of code is simple, there are actually many things happened behind the scene:

  1. RStudio starts running. RStudio is an integrated development environment (IDE) for R programming language. RStudio is written mainly in Java, C++, and JavaScript. You can look at their source code stats if you are interested by going to the following link, https://github.com/rstudio/rstudio.

  2. RStudio runs R programming language interpreter and shows its output in the Console tab. The R programming language interpreter is basically an infinite loop with memory of executions within the same session:
    • Prompt >.
    • Takes in code expression.
    • Executes the code expression. When execution is done, prompt > again and ready for the next code expression. However, if one or more variables are created or updated, the interpreter will remember them.
  3. You typed in print("Hello World!") and “enter” in the R interpreter.

  4. When R interpreter sees the “enter”, it checks your code before the “enter”. If you wonder why this is necessary, try type print("Hello World!" and “enter”. Note that the ) is missing at the end. Then, at the next line with the prompt +, type ) and “enter”.

  5. R interpreter found a complete code expression and starts executing the code without the “enter”.

  6. R interpreter starts executing print("Hello World!"), which is a function to print data out. At this point, the prompt [1] "Hello World!" is printed out. Question: why is there [1] before "Hello World"? Hint: try to run print(1:100), where 1:100 creates a vector of integers from 1 to 100.

  7. print("Hello World!") function returns "Hello World!", but there is no variable assigned to that return value. Try to run myVariable <- print("Hello World!") and see what is value in myVariable. You can see it either by typing myVariable and “enter” in the console, or find it in the Environment tab of another pane. Environment is basically a collection of variables.

  8. R interpreter evaluation loop resumes running to take in the next code expression. If you would like to see the suspension of R interpreter evaluation loop, try to run Sys.sleep(10) which asks the interpreter to idle for 10 seconds. You should see a blinking cursor at the beginning of the line. Try:
    • When the evaluation loop suspends, type in print("Hello World!") and “enter”.
    • When the evaluation loop suspends, hit “esc”.

1.4 Install and use packages

An R package is essentially a collection of code for certain purposes, such as performing differential expression analysis (DESeq2), convert R markdown to HTML/PDF/Word (Knitr), and general plotting (ggplot2).

In order to better share the code, an R package usually contains the following things as well:

  • Tutorials to help users to get quickly started.
  • Reference manuals containing the details about how to use the code for your own analysis.

In this section, we will go through the following:

  • Install packages:
    • Packages from CRAN using the install.packages() function.
    • Packages from Bioconductor using the biocLite() function.
  • Use packages:
    • Import packages using library() function.
    • Call functions in packages without importing using the libraryName::functionName syntax.
  • Get help from documentations

1.4.1 Install packages

1.4.1.1 CRAN

To install from CRAN, run the following code:

install.packages("ggplot2")
install.packages("dplyr")

If you are calling it for the first time, you might be prompted to select the mirrors, which is the same as downloading R programming language. You can just choose the same one as above http://cran.r-project.org.

install.packages() is the function to install package. The first parameter "ggplot2" is the name of the package.

1.4.1.2 Bioconductor

To install from CRAN, run the following code:

# Run the script at https://bioconductor.org/biocLite.R , which defines the 
# function `biocLite()`.
# You can use your browser to see the code by opening the link. 
source("https://bioconductor.org/biocLite.R")

# Install core packages
biocLite()

# Install a specific package named DESeq2
biocLite("DESeq2")

Note: # starts a comment, # and everything afterwards within the same line are not going to be executed.

1.4.1.3 Overview

For both CRAN and Bioconductor packages, the installing process of a package includes basically two steps:

  1. Download the package from CRAN or Bioconductor to your computer.
  2. Set up downloaded package(s) in one or more locations on your computer. These locations are called the library of your R environment, which can be checked using function .libPaths().

You might wonder why couldn’t we just have one repository. Just list a few reasons:

  • CRAN and Bioconductor developer communities are distinct. From my impresion, the CRAN community focus more on general usage such as plotting and data manipulations, whereas the Bioconductor community focus more on analyzing a specific type of biological data.

  • R packages are peer-reviewed by the community. A general purpose package developer might not be able to review a package specifically designed for biological data.

  • Publishing rules of Bioconductor are more strict than CRAN, which guarentee certain minimum level of package quality.

1.4.2 Use packages

The code for using CRAN and Bioconductors is the same. You can either import the whole package to your environment using function library(), or execute certain function within a package using the packageName::functionName syntax.

# Run functions using the :: operator
# NOTE: ggplot2::mpg refers to the dataset `mpg` in the package `ggplot2`

ggplot2::ggplot(ggplot2::mpg, ggplot2::aes(displ, hwy, colour = class)) + 
  ggplot2::geom_point()

# Import an installed CRAN package
library("ggplot2")

# Import an installed Bioconductor package
# Note that the quotes are optional
library(DESeq2)

# Run imported functions from ggplot2
# You will get an error if you run this without importing ggplot2
ggplot(mpg, aes(displ, hwy, colour = class)) + 
  geom_point()

You might wonder why we still need to import a library after installing it. The intuition is an analogy to installing a software on your computer. After you install a software, such as RStudio, you might not want to open it and keep it constantly running at the backgorund after starting your computer, because that will slow down your computer and make your working space crowded. When R interpreter starts running, it only has a minimum set of data and functions defined, which is similar to a just started computer. If you want to use the R interpreter to do something else, you need to get those data and functions into your environment, which is similar to open an application, such as browser, on your computer.

However, this analogy does not explain the packageName::functionName syntax, and we need more details to understand what really happened.

Recall that a package is a collection of code structured into a directory located in the R library on your computer. Following is a screen shot of one R library location:

R library

R library

Each folder is an installed package with the name of corresponding package, and the ggplot2 package is the first one in the list. The content of ggplot2 is structured within the folder. Following is the screenshot of the files within the ggplot2 folder:

ggplot2 package directory structure

ggplot2 package directory structure

All packages are structured in the same way but with different content in these files. The code of a package is stored in the R directory, and the datasets are in the data directory. Other files and directories contain documentations, license, citation, etc.

Therefore, using a package, either through importing or :: operator, is to ask R interpreter to use the stored code or datasets. However, there are important distinctions between these two methods.

Importing a packages loads all code and datasets into your R interpreter, so that you can call the package defined functions and use its datasets directly. However, importing will not execute the loaded code.

On the other hand, packageName::functionName does not load anything into your R interpreter but just execute the function. After the executed function returned, the function will not be loaded.

Because R interpreter starts with only a minimum set of packages imported, if you want to use a function or dataset within a package that has not been imported, you can either import that package and use it directly afterwards, or use it through :: operator.

1.4.3 Get help from the documentations

You can get help from two types of documentations, tutorial (vignette) and reference manual. A tutorial helps a beginner to quickly get started using the package, but a reference manual lists all the details about the code and datasets included in the package for a more experienced user to look up.

You can get both tutorial and manual either using RStudio or online.

1.4.3.1 Get documentations using RStudio

There are R functions to look up the package documentations in your library (recall the figure of package structure). The function vignette("packageName") looks up for tutorial, and help("functionName") looks up for the manual section of that function. After the function being executed by your R interpreter and the specified documentation was found, the documentation will show up in the help tab in one of your RStudio panes. You will receive an error in the console tab if the specified documentation was not found.

Following is a demo of how to use vignette() and help() functions.

# I want to see the the vignette of DESeq2
# This does not require DESeq2 to be imported
vignette("DESeq2")

# I want to get help on the function DESeq() in DESeq2
# This requires DESeq2 to be imported. If we did not import DESeq2 previously, 
# you will receive an error message. 
help(DESeq)

# However, if you do not want to import DESeq but just want to see the documentation
# of DESeq(), you can use the following code
help(DESeq, package = "DESeq2")

# If you want to know more about these two functions, check their manuals
help(vignette)
help(help)

1.4.3.2 Get documentations online

Google the package and go to either CRAN or Bioconductor depending on the package source. Find the documentations listed in the webpage. Following are the screenshots of the location of the documentation:

The first documentation of DESeq2 on Bioconductor Analyzing RNA-seq data with DESeq2 is the tutorial.

Documentations are usually very helpful, but also keep in mind that Google is also your best friend when learning programming. There are many helpful Q&As on StackOverflow and Biostars. StackOverflow community is focused on general programming and statistics, whereas Biostars community is focused on bioinformatics.

1.5 R Markdown

R Markdown is a markup language for formatting an analysis report written in R, which allows you to embed your R analysis code into a document explaining your analysis.

The R package knitr converts R Markdown files into generate easy-to-read report in HTML, PDF, or Word format.

To start using R Markdown in RStudio:

  1. Click menu File -> New File -> R Markdown. Then, you will be prompted by a wndow to fill in some relevant parameters. In the Default Output Format: option, select HTML (important), which is the standard markup language for creating web pages. It worth taking time to read the explanations of each option. After clicking OK, a new R Markdown file will show up in a pane of RStudio.

  2. Click the Knit option at the top of the Markdown file pane. Then, you will be prompted to save the R Markdown file. If you hit Knit for the first time, you will be prompted to install several packages that are required to run the “knitting” process. Install them.

  3. Give the file a name such as test and save it anywhere you prefer. Then, you will see an R Markdown tab show up in the Console pane, and it starts running. When running, knitr executes all R code and generates an HTML report. After it is done, a new window containing the generated HTML report will show up. The report file will also be stored in the same directory as the R Markdown file.

If you want to generate your report in PDF, you will need to install a extra tool on your computer outside of RStudio. This tool, TeX, helps knitr to convert R Markdown into PDF. In order to get this tool, click the tiny triangle at the right side of Knit option and select Knit to PDF, and you will receive an error message in the R Markdown tab. Follow the instructions in the error message carefully to install the TeX distribution.

We will expore the features of R Markdown in the next section.

1.6 R basic operations

In this section, we will go through the basics of R programming language by analyzing a simple and fake biological dataset.

The dataset is collected in an experiment to study the correlation between the expression level of a gene and the proliferation rate of a cell line. In the hypothesis, this gene is able to decrease the proliferation rate of the cells. This dataset includes the expression levels of the gene in wild type, knock down, and overexpressed cells and their proliferation rates. Each condition has 10 biological replicates.

This analysis includes the following steps:

  1. Create an R project.

  2. Create an R Markdown file.

  3. Load the dataaset into R.

  4. Analyze the loaded dataset:
    • Correlation between expression level and proliferation rate.
    • Compare expression level between conditions.
    • Compare proliferation rate between conditions.
  5. Output plots and tables.

Important: The things that you need to do are marked with ACTION.

1.6.1 Create an R project

ACTION: Create a new R project: Click menu File -> New Project -> Existing Directory -> select the directory of lab 1 handout as the project working directory (very important) -> create.

An R project is a session of RStudio, and the working directory is the directory where R interpreter starts running.

ACTION: After the project is created, type dir() and hit “enter” in your Console tab, you should see "cell_exp_proliferation.csv" included in the output together with some other files. The function dir() lists all files under the working directory of R.

The file named cell_exp_proliferation.csv is the dataset that we are going to analyze.

1.6.2 Create an R Markdown file in the project

ACTION: Create a new R Markdown file as described above, click menu File -> Save, and save it under lab 1 and 2 handout directory (very important).

The first section is your header, which is as following:

---
title: "Your Title"
author: "Your Name"
date: "Some Date"
output: html_document
---

Then it comes the first R code chunk named setup.

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

An R code chunk is in the following format:

```{r Your Code Chunk Name, yourCodeChunkParameters=yourParamterValue}
# Your R code to be evaluated by R interpreter. 
print("Hello World!")
```

Only code in the code chunks will be executed. Everything outside code chunks will be formated by knitr as plain text.

A important behavior of knitr is that when knitting an R Markdown file (after clicking Knit), knitr starts a brand new R interpreter and evaluates all code chunks from beginning to the end. As a result, knitr will not be able to access the variables in the console R interpreter environment.

ACTION: Delete everything below this chunk, and we are going to create our own analysis R Markdown file.

1.6.3 Load dataset into R environment

ACTION: Add a level 1 header to notify the start of the data loading section by adding the following text:

# Load dataset into R environment

The # at the beginning indicates that the line is a level one header. When knitr sees the # it will format it as the level 1 header in the output report. Similarly ## indicates a level 2 header, ### indicates a level 3 header, etc. More information about the formatting conventions can be found at http://rmarkdown.rstudio.com/authoring_basics.html.

ACTION: Add the following code chunk into the created R Markdown file to load the cell_exp_proliferation.csv into your environment:

```{r load dataset}
# Load the data of cell_exp_proliferation.csv into the variable exp_prolif_df
# 'df' in the end indicates that the variable is a dataframe
# stringsAsFactors=FALSE indicates that we do not want to convert strings into factors
exp_prolif_df <- read.csv("cell_exp_proliferation.csv", stringsAsFactors=FALSE)

# Print the data stored in variable exp_prolif_df
exp_prolif_df
```

The dataset is stored in a file named cell_exp_proliferation.csv under the same directory as the handout. A comma separated values (CSV) format is commonly used to store a spreadsheet in plain text. CSV file can be opened by text editor, Excel, and RStudio. The units of proliferation rate and expression level are both arbitrary.

Each row is a biological replicate. The columns are the proliferation rate, expression level, and condition.

In this code chunk, we loaded cell_exp_proliferation.csv as a dataframe into R environment. “R environment” has been mentioned many times in this handout, and you might wonder what it really is.

1.6.3.1 R environment

An R environment is associated with each R interpreter session, and it includes all variables that you can access in that R interpreter session. Recall that when an R interpreter session started, only a minimum set of packages are imported into its environment.

In the code chunk above, after the interpreter executes exp_prolif_df <- read.csv("cell_exp_proliferation.csv", stringsAsFactors=FALSE), its environment will have an additional variable named exp_prolif_df. Then, we can print that variable in that interpreter session.

However, you cannot print exp_prolif_df in your console session even after knitting the current document, because knitr starts a new session when knitting an document.

Similarly, if you import a library in an interpreter session, all code and datasets included in its associated R environment, and you can use the imported code and datasets in that interpreter session. However, if you quit the current interpreter session and starts a new one, you will need to import that package again before using its code and datasets.

In addition to the functions and datasets, an R environment also keeps track of many system parameters such as the library location, and the most important one is the current working directory. You might wonder why it is very important to create the R project and R Markdown files in the lab handout directory, and the reason is that the locations of the R project and R Markdown file are the current working directory of the Console R interpreter session and knitr interpreter session respectively.

If your current working directory does not include cell_exp_proliferation.csv, you will receive an error when running exp_prolif_df <- read.csv("cell_exp_proliferation.csv", stringsAsFactors=FALSE), because R interpreter cannot find the file cell_exp_proliferation.csv.

You can change your current working directory (CWD) by clicking menu Session -> Set Working Directory -> Choose Directory.

1.6.3.2 File path

The locations of files on your computer are specified by their paths. Following are some examples on Windows, Mac, and Linux.

  • Windows path of a file named very_important_file.txt under the C disk directory:

    C:\very_important_file.txt
  • Windows path of a file named not_important_file.txt under a directory called outdated_files under C disk directory:

    C:\outdated_files\not_important_file.txt 
  • Mac path of file named very_important_file.txt under your Documents directory:

    /Users/yourUserName/Documents/very_important_file.txt
  • Mac path of file named not_important_file.txt under a directory called outdated_files under your Documents directory:

    /Users/yourUserName/Documents/outdated_files/not_important_file.txt
  • Linux path of file named very_important_file.txt under your Documents directory:

    /home/yourUserName/Documents/very_important_file.txt
  • Linux path of file named not_important_file.txt under a directory called outdated_files under your Documents directory:

    /home/yourUserName/Documents/outdated_files/not_important_file.txt

A path describes the exact method to find your file. For example if you want to find the not_important_file.txt:

  • Windows: open C disk -> open directory outdated_files.

  • Mac: open finder -> open Documents directory -> open directory outdated_files.

  • Linux: similar to Mac.

The way how knitr R interpreter finds cell_exp_proliferation.csv is through a path relative to CWD. read.csv("cell_exp_proliferation.csv") asks R interpreter to find cell_exp_proliferation.csv under CWD. You can check your CWD path by typing getwd() in Console tab and hit “enter”. If a path is specified without / (Mac/linux) DiskLetter:\ (Windows), it will be prepended with the path of CWD to form an absolute path, and R interpreter will further look up the absolute path. If the absolute path does not exist, an error message will be prompted stating that the file is not found.

Following is an example of how R failed to find cell_exp_proliferation.csv:

  • Absolute path of cell_exp_proliferation.csv: /home/youUserName/Documents/lab1/cell_exp_proliferation.csv.
  • CWD path of an R interpreter session: /home/youUserName/Documents/someOtherDirectory.
  • Relative path R tries to find when we running read.csv("cell_exp_proliferation.csv") on the R session: cell_exp_proliferation.csv.
  • Absolute path R tries to find when we running read.csv("cell_exp_proliferation.csv") on the R session: /home/youUserName/Documents/someOtherDirectory/cell_exp_proliferation.csv.
  • Because /home/youUserName/Documents/someOtherDirectory/cell_exp_proliferation.csv does not exist, an error message will be prompted.

There are two exceptoinal relative paths: * . refers to the CWD. * .. refers the directory at the immediate upper level of CWD.

For example, when current working directory is /home/youUserName/Documents/someOtherDirectory:

  • The absolute path of . is the same as CWD: /home/youUserName/Documents/someOtherDirectory
  • The absolute path of .. is the immediate upper level directory of CWD: /home/youUserName/Documents
  • The absolute path of ../someOtherDirectory is the same as CWD.
  • The absolute path of ./. is the same as CWD.

1.6.4 Analyze the loaded dataset

We are going to do the following two analyses:

  1. Check linear correlation between expression level and proliferation rate using linear regression.

  2. Test whether the means of expression levels are significantly different between conditions using t-test.

  3. Test whether the means of proliferation rate are significantly different between conditions using t-test.

1.6.4.1 Check linear correlation

ACTION: Add a level 1 header to notify the start of data anlysis section:

# Analyze the loaded dataset

ACTION: Add a level 2 header to notify the start of linear correlation analysis:

## Check linear correlation between gene expression level and cell proliferation rate

ACTION: Add the following code to create a scatter plot of expression level versus proliferation rate:

```{r plot linear correlation}
# Import the package ggplot2
library(ggplot2)

# Create a scatter plot of expression level versus proliferation with linear regression line
lin_cor_plot <- ggplot(data = exp_prolif_df,
                       mapping = aes(x = Proliferation.Rate, 
                                     y = Expression.Level)) +
  geom_point(aes(color = Condition)) +
  geom_smooth(method="lm", se = FALSE) +
  labs(x = "Proliferation Rate", y = "Expression Level", color = "Condition")

# Display the plot. The function call print() is optional.
print(lin_cor_plot)
```

ACTION: Knit the document by clicking Knit.

In this code chunk, we imported ggplot2 and created a scatter plot using it. We will talk more about making plots using ggplot2 in later lab sessions. The code is explained line by line as the following:

  • lin_cor_plot <- ggplot(data = exp_prolif_df, mapping = aes(x = Proliferation.Rate, y = Expression.Level)):
    • Note that a code expression can be written in multiple lines following certain rules.
    • lin_cor_plot <- ggplot(): The function ggplot() is called to make a new plot and store that plot in variable lin_cor_plot.
    • data = exp_prolif_df, specifies that the dataset we are plotting is variable exp_prolif_df.
    • mapping = aes(x = Proliferation.Rate, y = Expression.Level) specifies that we want to plot the column Proliferation.Rate in x-axis and the column Expression.Level in the y-axis.
    • NOTE: that there is a + at the end of line, which indicates that the plotting is not ended yet.
  • geom_point(aes(color = Condition)):
    • geom_point(): make a scatter plot.
    • aes(color = Condition): color the points according to the Condition column.
  • geom_smooth(method="lm", se = FALSE):
    • geom_smooth: add a regression line to the scatter plot.
    • method = lm: use linear regression. lm refers to the function lm() that is used to fit a linear model.
    • se = FALSE: do not plot the confidence intervals of the linear regression.
  • labs(x = "Proliferation Rate", y = "Expression Level", color = "Condition")
    • labs(): label the plot.
    • x = "Proliferation Rate": label x-axis as Proliferation Rate.
    • y = "Expression Level": label y-axis as Expression Level.
    • color = "Condition": label the color of points as condition.
    • NOTE: Because there is no + at the end of the line, this line concludes the code expression to create the plot.

ACTION: Add the following code chunk to print out the statistics of the linear regression:

```{r print linear correlation stats}
# Print the statistics of linear regression
summary(lm(Expression.Level ~ Proliferation.Rate, data = exp_prolif_df))
```

ACTION: Knit the document by clicking Knit.

Code explanation:

  • summary(lm(Expression.Level ~ Proliferation.Rate, data = exp_prolif_df)):
    • summary(): summarize lm(Expression.Level ~ Proliferation.Rate, data = exp_prolif_df)
    • lm(): fit a linear model.
    • Expression.Level ~ Proliferation.Rate: the formula of the linear model, which specifies that we want the linear model to be Expression.Level column versus Proliferation.Rate column.
    • data = exp_prolif_df: the dataset is variable exp_prolif_df.

The p-value of the linear regression will be printed in the last line of the linear model stats.

1.6.4.2 Test gene expression level differences

The purpose of this analysis is to validate that the gene is knocked down or overexpressed.

ACTION: Add a level 2 header to notify the start of expression level comprisons:

## Compare gene expression levels between different conditions

ACTION: Add a code chunk to create a boxplot of the expression levels of all conditions:

```{r plot gene exp}
# Create boxplot of gene expression levels
exp_boxplot <- ggplot(data = exp_prolif_df, 
                      aes(x = Condition, y = Expression.Level, 
                          color = Condition)) + 
  geom_boxplot() +
  geom_dotplot(aes(fill = Condition), binaxis='y', 
               stackdir='center', binwidth = 1, dotsize = 0.75) +
  labs(x = 'Condition', y = 'Expression Level')

exp_boxplot
```

ACTION: Knit the document by clicking Knit.

Each box refers to a condition, and the dots are the measurements stacked into bins.

Code explanation:

  • ggplot(data = exp_prolif_df, aes(x = Condition, y = Expression.Level, color = Condition)) +: similar as previously mentioned.
  • geom_boxplot() +: make a boxplot
  • geom_dotplot(aes(fill = Condition), binaxis='y', stackdir='center', binwidth = 1, dotsize = 0.75) +:
    • geom_dotplot(): make a dotplot on top of the box plot.
    • aes(fill = Condition): fill the dots according to their conditions.
  • labs(x = 'Condition', y = 'Expression Level'): similar as previously mentioned.

ACTION: Add a code chunk to test the difference between each condition using t-test.

```{r test exp mean diff, message=FALSE}
library(dplyr)
# Select the expression levels of each condition and store them into individual variables. 
wt_exp <- exp_prolif_df[exp_prolif_df$Condition == 'Wild Type', 'Expression.Level']
kd_exp <- subset(exp_prolif_df, Condition == 'Knock Down', select = 'Expression.Level', drop = TRUE)
oe_exp <- filter(exp_prolif_df, Condition == 'Overexpression') %>% pull(Expression.Level)

# knock down versus wild type
t.test(x = kd_exp, y = wt_exp, alternative = 'l')

# wild type versus overexpression
t.test(x = wt_exp, y = oe_exp, alternative = 'l')

# knock down versus overexpression
t.test(x = kd_exp, y = oe_exp, alternative = 'l')
```

ACTION: Knit the document by clicking Knit.

The p-values will be printed out.

Code explanation:

We first selected the expression levels of the three conditions using three different methods:

  • wt_exp <- exp_prolif_df[exp_prolif_df$Condition == 'Wild Type', 'Expression.Level']: This line of code used boolean and direct indexing.
    • The syntax of indexing is: myDataFrameOrMatrix[rowIndicesToSelect, columnIndicesToSelect] .
    • There are many types of indices, such as number, column name, and boolean.
    • exp_prolif_df$Condition == 'Wild Type': boolean indexing. We want the rows of exp_prolif_df where the column Condition values are equal to the string "Wild Type". exp_prolif_df$Condition selects the Condition column of exp_prolif_df.
    • 'Expression.Level': column name indexing.
  • kd_exp <- subset(exp_prolif_df, Condition == 'Knock Down', select = 'Expression.Level', drop = TRUE): This line of code used the function subset(). Check the documentation of subset to look for more details.

  • oe_exp <- filter(exp_prolif_df, Condition == 'Overexpression') %>% pull(Expression.Level): This line of code used the package dplyr. Go to http://dplyr.tidyverse.org/index.html for more details.

The operator %>%, which is defined in the package dplyr, passes the return value of previous expression into the parameter of the next function call. For example, "Hello World!" %>% print() is equivalent to print("Hello World"). The main purpose of introducing the operator %>% is to improve the readability of R code.

Then, we used t-test to test for the differences of expression level between conditions.

  • t.test(x = wt_exp, y = oe_exp, alternative = 'l')
    • x = wt_exp, y = oe_exp: compare values in wt_exp and oe_exp.
    • alternative = 'l': alternative hypothesis of this test is that the mean of wt_exp is less than oe_exp.
    • Look at the documentataion of t.test for more details.

When performing t-test, we selected the expression levels and stored them into differrent variables. This procedure belongs to “data manipulation”.

1.6.4.2.1 Data manipulation

Data manipulation is a very general practice when doing data analysis, in which data in one form are turned into another form in order to complete the following analysis. For example, in this analysis, we have to extract the expression values from exp_prolif_df to complete t-tests.

There are basically 3 ways of data manipulation:

  • Select or filter: select certain data points of interest from a bigger dataset.

  • Summarize: calculate summary statistics, such mean and standard deviation, of multiple data points.

  • Reshape: change the “design” of a table. There are two types of table “design”, “wide” and “long” format. One format can be “reshaped” into another using packages like reshape2 and tidyr.

    • “Wide” table:

      geneName expressionInCellLineA expressionInCellLineB expressionInCellLineC
      GeneA 1 20 5
      GeneB 5 5 20
      GeneC 10 15 30
    • “Long” table:

      geneName cellLine expressionLevel
      GeneA expressionInCellLineA 1
      GeneB expressionInCellLineA 5
      GeneC expressionInCellLineA 10
      GeneA expressionInCellLineB 20
      GeneB expressionInCellLineB 5
      GeneC expressionInCellLineB 15
      GeneA expressionInCellLineC 5
      GeneB expressionInCellLineC 20
      GeneC expressionInCellLineC 30
    • melt/gather/stack: wide format -> long format.
    • dcast/spread/unstack: long format -> wide format.

An analysis might be more easily done if the data is in either wide or long format. For more details, check this introduction to the R package reshape2, http://seananderson.ca/2013/10/19/reshape.html .

1.6.4.3 Test cell proliferation rate differences

NOTE: I encrouage you to write the code chunks of this section yourself before using the code we provide below, because this analysis is the same as the previous section on expression level analysis.

ACTION: Add a level 2 header to notify the start of proliferation rate comprisons:

## Compare proliferation rates between different conditions

ACTION: Add a code chunk to create a boxplot of the proliferation rates of all conditions:

```{r plot proliferation rate}
# Create boxplot of proliferation rates
prolif_boxplot <- ggplot(data = exp_prolif_df, 
                         aes(x = Condition, y = Proliferation.Rate, 
                             color = Condition)) + 
  geom_boxplot() +
  geom_dotplot(aes(fill = Condition), binaxis='y', 
               stackdir='center', binwidth = 1, dotsize = 0.75) +
  labs(x = 'Condition', y = 'Proliferation Rate')

prolif_boxplot
```

ACTION: Knit the document by clicking Knit.

ACTION: Add a code chunk to test the difference between each condition using t-test.

```{r test proliferation rate diff}
# Select the proliferation rates of each condition and store them into individual variables. 
wt_prolif <- exp_prolif_df[exp_prolif_df$Condition == 'Wild Type', 'Proliferation.Rate']
kd_prolif <- subset(exp_prolif_df, Condition == 'Knock Down', select = 'Proliferation.Rate', drop = TRUE)
oe_prolif <- filter(exp_prolif_df, Condition == 'Overexpression') %>% pull(Proliferation.Rate)

# knock down versus wild type
t.test(x = kd_prolif, y = wt_prolif, alternative = 'g')

# wild type versus overexpression
t.test(x = wt_prolif, y = oe_prolif, alternative = 'g')

# knock down versus overexpression
t.test(x = kd_prolif, y = oe_prolif, alternative = 'g')
```

ACTION: Knit the document by clicking Knit.

1.6.5 Output plots and tables

Finally, we output the plots and tables for the record.

ACTION: Add a level 1 header to indicate the start of output section.

# Output results and plots

ACTION: Add a code chunk to save the plots we made.

```{r output plots, results='hide'}
# Save plot in a PDF file named cell_prolif_exp_plots.pdf
# You can also output your plots **individually** in other formats using other 
# functions, such as bmp(), jpeg(), and png(). Check the documentation of the 
# png() function for more details.
pdf('cell_prolif_exp_plots.pdf')

# Now, the following plots will be printed into cell_prolif_exp_plots.pdf .
lin_cor_plot
exp_boxplot
prolif_boxplot
# dev.off() tells R that we are done saving plots into cell_prolif_exp_plots.pdf .
dev.off()
```

ACTION: Knit the document by clicking Knit.

You should see a PDF file named cell_prolif_exp_plots.pdf generated in your CWD. You can open it using your file browser.

ACTION: Add a code chunk to calculate the mean and standard deviation of the data and output it into a table.

```{r output stats, results='hide'}
# Calculate summary statistics of all conditions
exp_prolif_stats_df <- exp_prolif_df %>% 
  group_by(Condition) %>% 
  summarise(expLevelMean = mean(Expression.Level),
            expLevelSd = sd(Expression.Level),
            prolifRateMean = mean(Proliferation.Rate),
            prolifRateSd = sd(Proliferation.Rate))

# Output the summary statistics table
write.csv(exp_prolif_stats_df, 'exp_prolif_stats.csv',
          row.names = FALSE)
```

ACTION: Knit the document by clicking Knit.

You should see a CSV file named exp_prolif_stats.csv generated in your CWD. You can open it using text editor or Excel.

This concludes our data analysis.

2 Optional

The optional topics will only be discussed in the lab session if it ends too early. The topics are selected by the TAs to help you better understand programming and statistics. If you are not interested in any optional topic, you can safely ignore it.

2.1 Run Linux operating system on a virtual machine

In this section, we will discuss the pros and cons of using Linux operating system. If you are interested, we are going to create a virtual machine on your computer to run Linux. Don’t worry, we are not going to ask you to erase all your data on your hard drive.

Linux operating system is developed based on the Unix operating system, so as the Mac OS. Following is the pedigree of Unix based operating system downloaded from https://en.wikipedia.org/wiki/MacOS:

Unix based operating system

Unix based operating system

On the other hand, Microsoft Windows is not based on Unix.

2.1.1 Unix based operating system is better for programming in the biological field

You might wonder what is the difference between Linux, Mac OS, and Windows, because they all look the same. Even when you program using an IDE, such as RStudio, things might look the same to you as well. Thus, why bother switching to Linux.

Firstly, it is not very easy to run some bioinformatics tools on Windows. For example, a popular aligner STAR only provides ready-to-run executable files for Linux and Mac OS. If you would like to run it on Windows, you need to compile the executable from source code, which is sometimes very difficult.

In addition, bash script, https://en.wikipedia.org/wiki/Bash_(Unix_shell), is very popular among bioinformaticians to do some simple data analysis, but it can only be easily run on Unix based systems. You might find answers to your programming questions written in bash script.

Most importantly but might not be very familiar to you, the high performance computing servers or clusters are mostly running Linux operating system. If you want to use these resources, you need to learn bash script.

Following is a list of a few general advantages and disadvantages of Unix based systems,

  • Advantages:
    • Most educational resources of programming use Unix based operating systems.
    • Free.
    • Fast.
    • Almost never freeze.
  • Disadvantages:
    • It is easier to wreck your system by a mistake, and it will be very time consuming or impossible to recover.
    • Have to learn a lot of things to get started.
    • Takes some time to get familiar with.

We can easily prevent wrecking our real operating system by using a virtual machine. If you mistakenly wrecked the virtual machine, just delete everything in it and create a new one.

2.1.2 Run Linux on virtual machine

A virtual machine is a guest operating system (OS) running on the host OS through virtualizing the hardware resources. When you start your computer, the host OS runs on the hardware of your computer, such as CPU, memory, and network card. When using virtual machine monitor (VMM), you are able to run a guest OS of any kind on your host OS. The VMM uses software resources provided by host OS to simulate a set of hardware resources for guest OS to run. In the following illustration, real and virtual hardware layers are labeled in blue, and software layers are labeled in yellow.

There are many VMM applications, and the most popular ones are costly VMware products and free VirtualBox.

Following are the instructions to use VirtualBox for running Linux as a virtual machine:

  1. Download VirtualBox installer from https://www.virtualbox.org/. Click Download. Download VirtualBox platform package according to your operating system.
  1. Open the downloaded file and follow the instructions to install. When prompted to install some other stuff, choose install.
  2. Download Ubuntu 16.04.3 LTS image from https://www.ubuntu.com/download/desktop. Wait for the file ubuntu-16.04.3-desktop-amd64.iso to be downloaded. Important: make sure that the there is amd64 in the file name you are downloading, and it means that the image is a 64-bit version.
  3. Open the installed VirtualBox software. You should see a window similar as following.
  1. Click New in the window. Fill the name with ubuntuVM. Important: make sure that the type is Linux and Version is Ubuntu (64-bit). Click Continue. Troubleshoot: If you cannot find the 64-bit option, try google for 32-bit ubuntu image and downloaded it. Then, in step 16, use the 32-bit image.
  1. Click Continue to set the memory size of the virtual machine we are creating as recommended.

  2. Choose Create a virtual hard disk now and click Create. This creates a file on your host OS to store the files in guest OS.

  1. Choose VDI (VirtualBox Disk Image) and click Continue.
  1. Choose Dynamically allocated and click Continue. It worth taking time to read through the explanation in the window.
  1. Choose the location of the guest virtual machine to be stored.
  2. Change the size of the virtual hard disk to 20-50GB depending on the free space on your computer. Click Create.
  1. You should see the newly created virtual machine in the list.
  1. Selected the newly created virtual machine and click Settings.
  1. Select Storage. Click the Empty under Controller: IDE.
  1. Click the small disk label in the marked with red rectangle in the previous image.
  1. Select Choose Virtual Optic Disk File... and choose the Ubuntu image you downloaded in step 3. You should see the information updated similarly as following.
  1. Click OK.

  2. Select the virtual machine and click Start.

  1. Click Install Ubuntu.

  2. Select Download updates when installing and continue.

  3. Select Erase disk and install Ubuntu. Don’t worry. This will not erase your real hard drive. This is only going to erase the virtual disk we created.

  4. Click Continue. Set up your time zone. Choose language. You might need to enlarge the window to click Continue.

  5. Set username and password. Important: remember your password. Click Continue. Then, wait for the guest OS to be installed on your virtual machine.

  6. Once the installation is done, click Restart Now. Similarly, this will only restart the virtual machine but not your real machine.

  7. When you see the instruction to remove the installation medium, remove the Ubuntu image from the virtual optic drive. Similar to steps 13 - 17. By default, VirtualBox should have removed the image.

  8. Enter the password to log in. After this step, your virtual machine is up and running.

Note: The following steps are optional, which allows you to copy and paste text between host and guest. This is usually a very useful feature.

  1. Click the window running the virtual machine.

  2. Click Install guest additions CD image.... You should see a disk show up in your virtual machine and click Run. Then type in the password you set up for your virtual machine to authenticate.

  1. Wait for the installation to be done. Following the instruction to hit return to close the window.

  2. Click menu Device -> Shared Clipboard -> Bidirectional. This will work after restarting the virtual machine.

  3. Restart your virtual machine by clicking the upper right corner and select Shutdown. Then choose restart at on the left.

  1. Try copy some text on your host machine and paste to your guest machine Firefox browser. It should work if everything went well. I have tested it on both Mac OS 10.12.6 (16G29) and Windows 10. Trouble shoot: If this does not work, try restart virtual box.

Enjoy!